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In this paper, we consider the multivariate Bernoulli distribution as a model to estimate the 
structure of graphs with binary nodes. This distribution is discussed in the framework of the 
exponential family, and its statistical properties regarding independence of the nodes are demon- 
strated. Importantly the model can estimate not only the main effects and pairwise interactions 
among the nodes but also is capable of modeling higher order interactions, allowing for the 
existence of complex clique effects. We compare the multivariate Bernoulli model with existing 
graphical inference models-the Ising model and the multivariate Gaussian model, where only 
the pairwise interactions are considered. On the other hand, the multivariate Bernoulli distri- 
bution has an interesting property in that independence and uncorrelatedness of the component 
random variables are equivalent. Both the marginal and conditional distributions of a subset of 
variables in the multivariate Bernoulli distribution still follow the multivariate Bernoulli distri- 
bution. Furthermore, the multivariate Bernoulli logistic model is developed under generalized 
linear model theory by utilizing the canonical link function in order to include covariate infor- 
mation on the nodes, edges and cliques. We also consider variable selection techniques suchas 
LASSO in the logistic model to impose sparsity structure on the graph. Finally, we discuss ex- 
tending the smoothing spline ANOVA approach to the multivariate Bernoulli logistic model to 
enable estimation of non-linear effects of the predictor variables. 

Keywords: Bernoulli Distribution, Generalized Linear Models, LASSO, Smoothing Spline. 

1. Introduction 

Undirected graphical models have been proved to be useful in a variety of applications in 
statistical machine learning. Statisticians and computer scientists devoted resources to 
studies in graphs with nodes representing both continuous and discrete variables. Such 
models consider a graph G — (V, E), whose nodes set V represents K random variables 
Y%, Y2, . . . , Yk connected or disconnected defined by the undirected edges set E. This 
formulation allows pairwise relationships among the nodes to be described in terms of 
edges, which in statistics are defined as correlations. The graph structure can thus be 
determined under the independence assumptions on the random variables. Specifically, 
variables and Yj are conditionally independent given all other variables if the asso- 
ciated nodes are not linked by an edge. Two important types of graphical models are 
the Gaussian model, where the K variables are assumed to follow a joint multivariate 
Gaussian distribution, and the Markov model, which captures the relationships between 
categorical variables. 

However, the assumption that only the pairwise correlations among the variables are 
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considered may not be sufficient. When the joint distribution of the nodes is multivariate 
Gaussian, the graph structure can be directly inferred from the inverse of the covariance 
matrix of the random variables and in recent years a large body of literature has emerged 
in this area for high dimensional data. Researchers mainly focus on different sparse struc- 
ture of the graphs or, in other words, the covariance matrix for high-dimensional obser- 
vations. For example, [11] proposes a consistent approach based on LASSO from [16] to 
model the sparsity of the graph. Due to the fact that the Gaussian distribution can be 
determined by the means and covariance matrix, it is valid to consider only the pair- 
wise correlations, but this may not true for some other distributions. The multivariate 
Bernoulli distribution discussed in [20], which will be studied in Section 3, has a probabil- 
ity density function involving terms representing third and higher order moments of the 
random variables, which is also referred to as clique effects. To alleviate the complexity 
of the graph, the so-called Ising model borrowed from physics gained popularity in the 
machine learning literature. [19] introduces several important discrete graphical models 
including the Ising model and [1] discussed a framework to infer sparse graph structure 
with both Gaussian and binary variables. In this paper, higher than second interactions 
among a group of binary random variables are studied in detail. 

What's more, in some real applications, people are not only interested in the graph 
structure but also want to include predictor variables that potentially have influence on 
the graph structure. [6] considers a multivariate Bernoulli model which uses a smoothing 
spline AN OVA model to replace the linear predictor ([10]) for main effects on the nodes, 
but set the second and higher order interactions between the nodes as constants. Higher 
order outcomes with hierarchical structure assumptions on the graph involving predictor 
variables are studied in [4]. 

This paper aims at building a unified framework of a generalized linear model for 
the multivariate Bernoulli distribution which includes both higher order interactions 
among the nodes and covariate information. The remainder is organized as follows. Sec- 
tion 2 starts from the simplest multivariate Bernoulli distribution, the so-called bivariate 
Bernoulli distribution, where there are only two nodes in the graph. The mathematical 
formulation and statistical properties of the multivariate Bernoulli distribution are ad- 
dressed in Section 3. Section 4 serves to get a better understanding of the differences 
and similarities of the multivariate Bernoulli distribution with the Ising and multivariate 
Gaussian models. Section 5 extends the model to include covariate information on the 
nodes, edges and cliques, and discusses parameter estimation, optimization and associ- 
ated problems in the resulting multivariate Bernoulli logistic model. Finally Section 6 
provides conclusion of the paper and some proofs are deferred to Appendix. 

2. Bivariate Bernoulli Distribution 

To start from the simplest case, we extend the widely used univariate Bernoulli distribu- 
tion to two dimensions in this section and the more complicated multivariate Bernoulli 
distribution is explored in Section 3. The Bernoulli random variable V, is one with binary 
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outcomes chosen from {0, 1} and its probability density function is 

f Y ( y )= p y(i-p) 1 -y. 

Next, consider bivariate Bernoulli random vector (Yi, Y2), which takes values from (0, 0), 
(0, 1), (1,0) and (1, 1) in the Cartesian product space {0, l} 2 = {0, 1} x {0, 1}. Denote 
Pij = P{Y\ = i,Y2 = j), i,j = 0, 1, then its probability density function can be written 

as 

p ( Y = y) = p(vx,V2) 

_ „l/lW2_Wl(l-y2) (l-2/l)2/2„(l-2/l)(l-2/2) fry -,\ 

— P11 Pw P01 Poo \ Z - L ) 

Pw\ , , f Poi\ , , /puPoo\ 

2/12/2 log 

\P10P01 J 



cxp <^ log(p o) + 2/1 log — +2/2 log 

\P00J \Poo 



where the side condition p o +P10 + P01 + P11 = 1 holds to ensure it is a valid probability 
density function. 

To simplify the notation, define the natural parameters /'s from general parameters 
as follows: 

f 1 = logM, (2.2) 
\P00J 

f = logf^l), (2.3) 
\P00J 

f 12 - bgf^V (2.4) 



VP10P01/ 

and it is not hard to verify the inverse of the above formula 

1 

P °° = 1 + cxp(/i) + cxp(/ 2 ) + cxp(/i + p + /") ' (2 ' 5) 
exp(/ 1 ) 

Pl0= 1 + exp(/i) + cxp(/ 2 ) '+ cxp(/i + p + p*) ' (2 ' 6) 



ex P (/ 2 ) 

1 + cxp(/i) + cxp(/ 2 ) + cxp(/i + p + /1 2 ) ■ 



"01 1 , 1 ^ f2\ 1 nvT ^ f 1 1 f2 1 f 12V \ Z -') 



exp(/ 1 + f 2 + f 12 ) 
PU = 1 + cxp(/i) + cxp(/ 2 ) + cxp(/i + p + ,/ 12 ) ' (2 ' 8) 

Here the original density function (2.1) can be viewed as a member of the exponential 
family, and represented in a log-linear formulation as: 

P(Y =y) = cxp {log(p o) + 2/iZ 1 + 2/2/ 2 + 2/12/2/ 12 } . (2.9) 

Consider the marginal and conditional distribution of Y\ in the random vector (Y\ , Y2) , 
we have 
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Proposition 2.1. The marginal distribution ofY\ in a bivariate Bernoulli vector (Yi, Y2) 
following density function (2.1) is univariate Bernoulli with density 

P(Yi = Vi) = (Pw+Pii) Vl (Poo+Poi) (1 - yi) . (2.10) 

What 's more, the conditional distribution of Y\ given Yi is also univariate Bernoulli with 
density 

1 1 " ' \p(l,y2)+p(0,y 2 )J \p(l,y2)+p(0,y 2 )J 1 ; 

The proposition implies that the bivariate Bernoulli distribution is similar to the 
bivariate Gaussian distribution, in that both the marginal and conditional distributions 
are still Bernoulli distributed. On the other hand, it is also important to know under 
what conditions the two random variables Y\ and Y2 are independent. 

Lemma 2.1. The components of the bivariate Bernoulli random vector (Yi,!^) are 
independent if and only if f 12 in (2.9) and defined in (2.4) is zero. 

The Lemma 2.1 is a special case for Theorem 3.1 in Section 3, and the proof is 
attached in Appendix. It is not hard to see from the log-linear formulation (2.9) that 
when f 12 = 0, the probability density function of the bivariate Bernoulli is separable in 
yi and y2 so the lemma holds. In addition, a simple calculation of covariance between Y\ 
and Y2 gives 

cov(Yi,r 2 ) = E[Yx - ( Pll + p 10 )}{Y 2 - (pn + P01)} 

= P11P00 -P01P10, (2.12) 

and using (2.4), the disappearance of f 12 indicates that the correlation between Y\ and 
Y2 is null. When dealing with the multivariate Gaussian distribution, the uncorrelated 
random variables are independent as well and Section 3 below shows uncorrelatedness 
and independence is also equivalent for the multivariate Bernoulli distribution. 

The importance of Lemma 2.1 was explored in [20] where it was referred to as propo- 
sition 2.4.1. The importance of f 12 (denoted as u-terms) is discussed and called cross- 
product ratio between Y\ and Y2. The same quantity is actually log odds described for 
the univariate case in [10] and for the multivariate case in [9]. 



3. Formulation and Statistical Properties 
3.1. Probability Density Function 

As discussed in Section 2, the two dimensional Bernoulli distribution possesses good 
properties analogous to the Gaussian distribution. This section is to extend it to high 
dimensions and construct the so-called multivariate Bernoulli distribution. 
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Let Y = (Yi, Y2, ■ • ■ ,Yk) be a if -dimensional random vector of possibly correlated 
Bernoulli random variables (binary outcomes) and let y = (j/i, . . . ,v,k) be a realization 
of Y. The most general form p(y±, . . . , yx) of the joint probability density is 



P{Y 1 =y ll Y 2 =y 2 ,...,Y K = y K ) 



■■,Vk) 

i0 )in" 1 ( 1 -*)] 



p(yi,V2, ■ 
p(0,0,... 
p(l,0,...,0) 

p(o J i,... > o) K1 - to)wa njL,( 1 -»i)] 

.../^l.l lill" v . 



or in short 



= Po,0,...,0 Pl.0,-,0 /'.:•.! ■■■!>. A I 

To simplify the notation, denote the quantity S to be 



ghh—ir 



E f js + E /**+•••+/■ 



J1.72---.7t- 



(3.1) 



(3.2) 



K«t<r 



and in the bivariate Bernoulli case S* 12 = f 1 + f 2 + f 12 . To eliminate the product in the 
tedious exponent of (3.1), define the interaction function B 



(3.3) 



so correspondingly in the bivariate Bernoulli distribution for the realization (2/1,2/2) of 
random vector (Yi, Yj), the interaction function of order 2 is B 12 (y) = y\y<i- This is the 
only available order two interaction for the bivariate case. In general, there are = 
i£i^zll different second interactions among the binary components of the multivariate 
Bernoulli random vector. 

The log-linear formulation of the multivariate Bernoulli distribution induced from 
(3.1) is 



%,f) 



log[p(y)] 



K 



r=l \l<Ji<j2<...<ir<A' 



(3.4) 



where f = (f 1 , f 2 , . . . , f 12 — K y r is the vector of the natural parameters for multivariate 
Bernoulli, and the normalizing factor 6(f) is defined as 



K 



6(f) = log E 



E exp[S^ 2 ->] 

>l<jl<h<-<jr<K 



(3.5) 



As a member of the exponential distribution family, the multivariate Bernoulli distri- 
bution has the fundamental 'link' between the natural and general parameters. 
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Lemma 3.1. (Parameter transformation). For the multivariate Bernoulli model, the 
general parameters and natural parameters have the following relationship. 

exp(f jlh - jr ) = 

[\p(even # zeros among . . . , j r components and other components are all zero) 

[\p(odd # zeros among j\Ji ■ ■ ■ ,j r components and other components are all zero) ' 

where # refers to the number of zeros among the superscript . . .yj r of f . In addition, 

V{ji,h ■■■Jr positions are one, others are zero) 

cxp S J1J2 "> = 77-^ ^ (3-6) 

p(0,0, ... ,0) 

and conversely the general parameters can be represented by the natural parameters 

P\Ji>]2 ■ ■ ■ , Jr positions are one, others are zero) = — (3-7) 

exp(b(f)) 

Based on the log-linear formulation (3.4) and the fact that the multivariate Bernoulli 
distribution is a member of the exponential family, the interactions functions B li:l2 "' Jr (y) 
for all combinations j\]2 ■ ■ ■ j r define the sufficient statistics. In addition, the log-partition 
function 6(f) as in (3.5) is useful to determine the expectation and variance of the suffi- 
cient statistics to be addressed in later sections. 

3.2. Independence, Marginal and Conditional Distributions 

One of the most important statistical properties for the multivariate Gaussian distribu- 
tion is the equivalence of independence and uncorrelatedness. As a natural multivariate 
extension of the univariate Bernoulli distribution, it is of great interest to explore in- 
dependence among components of the multivariate Bernoulli distribution and it is the 
topic for this section. 

The independence of components of a random vector is determined by separability of 
coordinates in its probability density function and it is hard to get directly from (3.1). 
However, based on the relationship between the natural parameters and the outcome 
in the log-linear formulation (3.4), the independence theorem of the distribution can be 
derived as follows with proof deferred to Appendix. 

Theorem 3.1. (Independence of Bernoulli outcomes) For the multivariate Bernoulli 
distribution, the random vector Y = (Y\, . . . , Yk) is independent element-wise if and only 
*f 

f hh-jr = o, y 1 < ji < j 2 < . . . < Jr < K, r > 2. (3.8) 
In addition, the condition in equation (3.8) can be equivalently written as 

r 

S hh-jr /' . Vr > 2 (3.9) 

k=l 
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The importance of the theorem is to link the independence of components of a random 
vector following the multivariate Bernoulli distribution to the natural parameters. Notice 
that to ensure all the single random variable to be independent of all the others is a strong 
assertion and in graphical models, researchers are more interested in the independence 
of two groups of nodes, so we have the following theorem 

Theorem 3.2. (Independence of Groups) For random vector Y = (Yi,...,Yk) fol- 
lowing the multivariate Bernoulli distribution, without of loss of generality, suppose two 
blocks of nodes Y' = (Yi, Y 2 , . . . , Y r ), Y" = (Y r+1 ,Y r+2 , . . . ,Y S ) with 1 < r < s < K, 
and denote index set T\ = {1, 2, . . . , r} and ti = {r + 1, r + 2, . . . , s}. Then Y' and Y" 
are independent if and only if 



The proof of Theorem 3.2 is also deferred to Appendix. The theorem delivers the 
message that the two groups of binary nodes in a graph are independent if all the natural 
parameters /'s corresponding to the index sets that include indices from both groups 
disappear. 

Furthermore, analogous to the multivariate Gaussian distribution, researchers are in- 
terested in statistical distributions of marginal and conditional distributions for the mul- 
tivariate Bernoulli distribution. Likewise, the multivariate Bernoulli distribution main- 
tains the good property that both the marginal and conditional distributions are still 
multivariate Bernoulli as stated in the following proposition. 

Proposition 3.1. The marginal distribution of the random vector (Yi, . . . , Yk) which 
follows multivariate Bernoulli distribution with density function (3.1) to any order is still 
a multivariate Bernoulli with density 



for some r < K. 

What's more, the conditional distribution o/(Y"i, Y 2 , ■ ■ ■ , Y r ) given the rest is also mul- 
tivariate Bernoulli with density 



P(Y 1 = Vl ...,Y r = y r \Y r+1 = y r+1 , . . . ,Y K = y K ) = , 1 ""^ A "\ . (3.12) 



f T = 0, V r n n ^ and t n t 2 ^ 



(3.10) 




(3.11) 



P(Vr+l, ■■■ 1 Vk) 
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3.3. Moment Generating Functions 



The moment generating function for the multivariate Bernoulli distribution is useful 
when dealing with moments and proof of Theorem 3.1. 



E [exp {nxYt + pl 2 Y 2 + ■■■ + hkY k )} 
Poo...oe° + Pio.-.o^ 1 + • • ■ + Pn...ie^ + ^- + - + ^ 
cxp'.S'-'- 



= E E 

<"=1 3l<3'I<---<3r 



exp [6(f)] 



■ exp 



E^ 

.fc=i 



(3.13) 



Hence, from the formula the moment generating function is solely determined by the S 
functions, which are the transformation of the natural parameters /'s. 



3.4. Gradient and Hessian 

As a member of the exponential family, the gradient and Hessian (Fisher information) 
are the mean and covariance of the random vector (Y±, Y 2l . . . , Yk). Therefore, they are 
important in statistics but also crucial for model inference when the proper optimiza- 
tion problem is established. To examine the formulation of gradient and Hessian for the 
logarithm of the multivariate Bernoulli distribution (3.1), let us define some notations. 

Denote T to be the set of all possible superscripts of the /'s including the null super- 
script with /" = 0, so it has 2 K elements. In other words, T is the power set of indices 
{1,2,..., K}. Let | • | be the cardinality of a set then \T\ = 2 K . We can define the relation 
subset C for n, t 2 G T as follows. 

Definition 3.1. For any two superscripts t± = {ji, j 2 , ■ ■ ■ , j r } such that n £ T and 
7i = {ki, &2; • • • j k s } with t 2 € T and r < s, we say that t\ C t 2 if for any j G n, there 
is a k G t 2 such that j = k. 

Based on the definition, the S"s in (3.2) can be reformulated as 

specifically, S* = 0. Consider the gradient of the log-linear form (3.4) with respect to the 
f's, for any r <E T, 

E T > exp[S T0 ] 
= -B T {y)+ °-; (f) • (3-15) 
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The derivation of partial derivative of b with respect to f T in (3.15) is 

db{{) 1 dexp[6(f)] 

~W ~ exp[6(f)] ' dp 

1 9E T „ er exp[51 



exp[6(f)] 3/* 
E To p r exp[^°] 
exp[6(f)] 

£[£ T ( y )], 



(3.16) 



and the result can also be derived from the moment generating function (3.13) by taking 
derivatives with respect to the /i's. 

A simple example of (3.15) in the bivariatc Bernoulli distribution (2.9) is 

dl(y,f) _ , exp^+exp^ 12 ) 



df 1 ~ Vl ' 6(f) 

Further, the general formula for the second order derivative of (3.4) with respect to 
any two natural parameters f Tl and f T2 is 

d 2 l(yj) d 2 b(i) 



d /E^exp^ 



df^ V exp[6(f)] 
E T0 3 T1 , To2T2 eMS T °] expjbffl] - exp[S^] E To > 2 exp[^°] 

exp [26(f)] 

= cov (B^(y),B^(y)). (3.17) 

In the bivariate Bernoulli distribution, 

d 2 l(y,f) = cxp[S 12 ] exp[b(f)] - (expt/ 1 ] +exp[^ 12 ])(exp[/ 2 ] +exp[5 12 ]) 
dpdf 2 exp[26(f)] 



4. The Ising and the Multivariate Gaussian Models 

As mentioned in Section 1, the Ising and the multivariate Gaussian distributions are 
two main tools to study undirected graphical models, and this section is to compare the 
multivariate Bernoulli model introduced in Section 3 with these two popular models. 



4.1. The Ising Model 



The Ising model, which originated from [8] , becomes popular when the graph structure is 
of interest with nodes taking binary values. The log-linear density of the random vector 
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(Yi,...,Y K ) is 

K 

io g [/(y 1 ,...,y K )] = ^^^+ £ e 3 , r Y 3 Y r -\o g [z(Q)], (41) 

j=l \<j<j'<K 

where = (9jJ')kxk is a symmetric matrix specifying the network structure, but it is 
not necessarily positive semi-definite,. The log-partition function Z{Q) is defined as 

Z(0)= £ exp (^e^Yj + J2 OU'WA, (4-2) 

Yje{0.1},l<j<K \j=l l<j<j'<K J 

and notice that it is not related to Yj due to the summation over all possible values of 
Yj for j = 1,2,..., K. 

It is not hard to see that the multivariate Bernoulli is an extension of the Ising model, 
which assumes all S T = for any t such that |t| > 2 and 9jji = . In other words, in 
the Ising model, only pairwise interactions are considered. [13] pointed out that the higher 
order interactions, which is referred to as clique effects in this paper, can be converted 
to pairwise ones through the introduction of additional variables and thus retain the 
Markovian structure of the network defined in [19]. 

4.2. Multivariate Gaussian Model 

When continuous nodes are considered in a graphical model, the multivariate Gaussian 
distribution is important since, similar to the Ising model, it only considers interactions 
up to order two. The log-linear formulation is 

\o g [f(Y 1 ,...,Y K )} = (-i(y-^) T £(y- M )) -log[Z(£)], (4.3) 

where Z(Yi) is the normalizing factor which only depends on the covariance matrix E. 

4.3. Comparison of Different Graphical Models 

The multivariate Bernoulli (3.4), Ising (4.1) and multivariate Gaussian (4.3) are three 
different kinds of graphical models and they share many similarities 

1. All of them are members of the exponential family. 

2. Uncorrelatedness and independence are equivalent. 

3. Conditional and marginal distributions maintain the same structure. 

However, some differences do exist, the multivariate Bernoulli and the Ising models 
both serve as tools to model graph with binary nodes, and are certainly different from 
the multivariate Gaussian model which formulates continuous variables. In addition, 
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the multivariate Bernoulli specifics clique effects among nodes whereas the Ising model 
simplifies to deal with only pairwise interactions and the multivariate Gaussian essentially 
is uniquely determined by its mean and covariance structure, which is also based on first 
and second order moments. Table 1 illustrates the number of parameters needed to 
uniquely determine the distribution for these models as the number of nodes K in the 
graph increases. 



Graph dimension 


multivariate Bernoulli 


Ising 


multivariate Gaussian 


1 


1 


1 


2 


2 


3 


3 


5 


3 


7 


6 


9 


K 


2 K - 1 


K(K+1) 
2 


K+ K(K+1) 



Table 1. The number of parameters in the multivariate Bernoulli, the Ising and the multivariate 

Gaussian models. 



5. Multivariate Bernoulli Logistic Models 
5.1. Generalized Linear Model 

As discussed in Section 3, the multivariate Bernoulli distribution is a member of the 
exponential family and as a result, the generalized linear model theory in [10] applies. 
The natural parameters (f's) in Lemma 3.1 can be formulated as a linear predictor in 
[10] such that for any r G T with T ={1,2,..., K} 



f T (x) = 4 + c[xi H hc^p, 



(5.1) 



where the vector c r = (cq, . . . , c£) for r G T is the coefficient vector to be estimated 
and x = (.Ti, X2, ■ ■ ■ , x p ) is the observed covariate. Here p is the number of variables and 
there are 2 K — 1 coefficient vectors to be estimated so in total p x (2 K — 1) unknown 
parameters. (5.1) is built on the canonical link where natural parameters are directly 
modeled as linear predictors, but other links are possible and valid as well. 

When there are n samples observed from a real data set with outcomes denoted as 
y(i) = (yi(i), . . . , yx (i)) and predictor variables x(i) = (x\(i), . . . ,x p {i)), the negative 
log likelihood for the generalized linear model of the multivariate Bernoulli distribution 



l(y,i(x)) = J2 



i=l 



^f T W))B T (yM) + b(i(x)) 



(5.2) 



where, similar to (3.5) the log partition function b is 



b(f (ar)) = log 



eMS T {x(i))} 
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When dealing with the univariate Bernoulli distribution using formula (5.2), the re- 
sulting generalized linear model corresponding to the multivariate Bernoulli model is the 
same for logistic regression. Thus the model is referred to as the multivariate Bernoulli 
logistic model in this paper. 



5.2. Gradient and Hessian 



To optimize the negative log likelihood function (5.1) with respect to the coefficient vector 
c r , the efficient and popular iterative re- weighted least squares algorithm mentioned in 
[10] can be implemented. Nevertheless, the gradient vector and Hessian matrix (Fisher 
Information) with respect to the coefficients c T are still required. 

Consider any r G T, the first derivative with respect to cj in the negative log likelihood 
(5.2) of the multivariate Bernoulli logistic model, according to (3.15) and ignoring index 
i, is 



dl(y,f) 



dljyj) dp 



E 



-B T (y) + 



E ro3r exp[^Q(x)] 
exp[&(f(aO)] 



(5.3) 



Further, the second derivative for any two coefficients c,, 1 and c? is 



d 2 l(yj) 



d fdl(y,f)df 



dej 1 V df* dc T k 2 

df2 d 2 l(yj) dn 
dc] 1 df^df^ del 2 

d 2 i(y,f) 

E T0 D Tl)T ,D T2 exp[^ 



- Xj Xk 



exp[6(/(x))] 
E Tn 3 Tl cxp[^(x)] E, 3 T2 e X p[S T0 (x)} 
cxp[26(/(x))] 



XjX}» 



(5.4) 



5.3. Parameters Estimation and Optimization 

With gradient (5.3) and Hessian (5.4) at hand, the minimization of the negative log 
likelihood (5.2) with respect to the coefficients c T can be solved with Newton-Raphson 
or the Fisher's scoring algorithm (iterative re-weighted least squares) when the Hessian 
is replaced by the Fisher information matrix. Therefore, in every iteration, the new step 
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size for current estimate is computed as 



Ac 



d 2 l(yj) 



dc?dc? 



dl(y,f) 



del 



(5.5) 



The process continues until the convergence criterion is met. 



5.4. Variable Selection 

Variable selection is important in modern statistical inference. It is also crucial to select 
only the significant variables to determine the structure of the graph for better model 
identification and prediction accuracy. The pioneering paper [16] introduced the LASSO 
approach to linear models. Various properties of the method were demonstrated such as 
in [22] and extensions of the model to different frameworks were discussed in [11], [23], 
[12] etc. 

The approach can be extended to the multivariate Bernoulli distribution since it is a 
member of the exponential family. What we have to do is to apply the l\ penalty to the 
coefficients in (5.1), and the target function is 

-, n p 

L x (x,y) = -X>(i),f(s(i))) + £ A r 5>J|, (5.6) 

i=l reT j=l 

where A T are the tuning parameters need to be chosen adaptively. The superscript r 
allows flexibility to have natural parameters with different levels of complexity. For tun- 
ing in penalized regression problems, the randomized generalized approximate cross- 
validation (GACV) designed for smoothing spline models introduced in [21] can be de- 
rived for LASSO problem, such as in [15]. The widely used information criterion AIC and 
BIC can also be implemented, but the degrees of freedom cannot be calculated exactly. [9] 
demonstrates that the number of nonzero estimates can serve as a good approximation in 
the multivariate Bernoulli logistic model. There are several efficient algorithms proposed 
to optimize the problem (5.6), for example, the LASSO-patternsearch introduced in [15] 
can handle large number of unknowns provided that it is known that at most a modest 
number are non-zeros. Recently, [14] has extended the algorithm in [15] to the scale of 
multi-millions of unknowns. Coordinate descent [5] is also proven to be fast in solving 
large p small n problems. 



5.5. Smoothing Spline ANOVA Model 

The smoothing spline model gained popularity in non-linear statistical inference since it 
was proposed in [2] for univariate predictor variables. More importantly, multiple smooth- 
ing spline models for generalized linear models enable researchers to study complex real 
world data sets with increasingly powerful computers as described in [18]. 
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As a member of the exponential family, the multivariate Bernoulli distribution can 
be formulated under smoothing spline ANOVA framework. [6] considers the smoothing 
spline ANOVA multivariate Bernoulli model but the interactions are restricted to be 
constant. However, in general the natural parameters or linear predictors /'s can be 
relaxed to reside in a reproducing kernel Hilbert space. That is to say, for the observed 
predictor vector x, we have 

p-( x ) = rf( x )^ w ith V T e U\ t 6 T, (5.7) 

where TV is a reproducing kernel Hilbert space and the superscript r allows a more 
flexible model such that the natural parameters can come from different reproducing 
kernel Hilbert spaces. Further, W can be formulated to have several components to 
handle multivariate predictor variables, that is 'H T = ©o = q"H^ and details can be found 
in [7]. 

As a result, the rf is estimated from the variational problem 

1 ™ 

l x (x,y) = - X)J(y(i),77(z(i))) + AJ(r/), (5.8) 
i=l 

where r\ is the vector form of 7y T 's. The penalty is seen to be 

AJ( ?? )=A^^ 1 ||P 1 VH 2 (5.9) 

with A and 9 T being the smoothing parameters. This is an over-parameterization adopted 
in [7], as what really matters are the ratios X/6 T . The functional P{ projects function rf 
in W onto the smoothing subspace T-L[. 

By the argument of smoothing spline ANOVA model in [7], the minimizer rf has the 
expression as in [17], 

m n 

rf{x) = WUx) + E c ! RT ( x i, x)> ( 5 - 10 ) 
i/=i i=i 

where {0^}^! is a basis of Hq = 7i T QTil , the null space corresponding to the projection 
functional P{ . R T (-,-) is the reproducing kernel for T-C[. 

The variational problem (5.8) utilizing the smoothing spline ANOVA framework can 
be solved by iterative re- weighted least squares (5.5) due to the linear formulation (5.10). 
More on tuning and computations including software will appear in [3]. 

6. Conclusion 

We have shown that the multivariate Bernoulli distribution, as a member of the expo- 
nential family, is a way to formulate the graph structure of binary variables. It can not 
only model the main effects and pairwise interactions as the Ising model does, but also 
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is capable of estimating higher order interactions. Importantly, the independence struc- 
ture of the graph can be modeled via significance of the natural parameters. The most 
interesting observation of the multivariate Bernoulli distribution is its similarity to the 
multivariate Gaussian distribution. Both of them have the property that independence 
and uncorrelatedness of the random variables are equivalent, which is generally not true 
for other distributions. In addition, the marginal and conditional distributions of a subset 
of variables still follow the multivariate Bernoulli distribution. 

Furthermore, the multivariate Bernoulli logistic model extends the distribution to a 
generalized linear model framework to include effects of predictor variables. Under this 
model, the traditional statistical inferences such as point estimation, hypothesis test and 
confidence intervals can be implemented as discussed in [10]. 

Finally, we consider two extensions to the multivariate Bernoulli logistic model. First, 
the variable selection technique using LASSO can be incorporated to enable finding 
important patterns from a large number of candidate covariates. Secondly, the smoothing 
spline AN OVA model is introduced to consider non-linear effects of the predictor variables 
in nodes, edges and cliques level. 



Appendix A: Proofs 

Proof, of Proposition 2.1 

With the joint density function of the random vector (Yi , Y 2 ), the marginal distribution 
of Y\ can be derived 

P(Y 1 = 1) = P(Y 1 = 1,Y 2 = 0)+P(Y 1 =1,Y 2 = 1) 
= Pio+Pn- 

Similarly, 

P(Yi = 0) =p o +Pn- 

Combining the side condition of the parameters p's, 

P(Yt = 1) + P(Yi = 0) = poo + P01 + P10 + P11 = 1. 

This demonstrates that Yj follows the univariate Bernoulli distribution and its density 
function is (2.1). 

Regarding the conditional distribution, notice that 

^ _„,*-„, . £1™^ 

Poo 



Poo + P10 ' 

and the same process can be repeated to get 
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Hence, it is clear that with condition Y 2 = 0, Y\ follows a univariate Bernoulli distri- 
bution as well. The same scenario can be examined for the condition Y-i = 1. Thus, the 
conditional distribution of Y\ given Y2 is given as (2.11). □ 

Proof, of Lemma 2.1 

Expand the log-linear formulation of the bivariate Bernoulli distribution (2.9) into 
factors 

P(Fi = j/i, Y 2 = y 2 ) = Poo cxp^i/ 1 ) cxp(y 2 / 2 ) exp(yiy 2 / 12 ). (A.l) 

It is not hard to see that when f 12 = 0, the density function (A.l) is separable to two com- 
ponents with only y\ and y 2 in them. Therefore, the two random variables corresponding 
to the formula are independent. Conversely, when Y± and Y 2 are independent, their den- 
sity function should be separable in terms of yi and y 2 , which implies y\y 2 f 12 = for 
any possible values of y\ and y 2 ■ The assertion dictates that f 12 is zero. □ 

Proof, of Lemma 3.1 

Consider the log-linear formulation (3.4), the natural parameters f's are combined 
with products of some components of y. Let us match terms in the fi^—^B 31 " Jr (y) from 
log-linear formulation (3.4) with the coefficient for the corresponding product . . . y^ r 
terms in (3.1). The exponents of p's in (3.1) can be expanded to summations of different 
products B T (y) with r £ T and all the p's with yj 1 . . . yj r in the exponent have effect 
on /J' 1 — > so all the positions other than ji, . . . j r must be zero. Furthermore, those p's 
with positive yj 1 . . .yj r in its exponent appear in the numerator of exp[f J1 '" 3T ] and the 
product is positive only if there are even number of 0's in the positions ji, . . . , j r . The 
same scenario applies to the p's with negative products in the exponents. 

What's more, notice that P00...0 = b(f) and 

exp[^->] = exp[^/^+ + + 

l<s<r l<s<t<r 

= U exp /' ][ cn|) /' ' •••-xp .P - ' (A.2) 

l<s<r l<s<t<r 

and apply the formula for exjp[f J1 '" Jr ] with cancellation of terms in the numerators and 
the denominators. The resulting (3.6) can then be verified. 

Finally, (3.7) is a trivial extension of (3.6) by exchanging the numerator and the 
denominator. □ 



Proof, of Theorem 3.1 

Here, we take use of the moment generating function (3.13) but it is also possible to 
directly work on the probability density function (3.1). The mgf can be rewritten as 

K r 

^ ' • ■ ' = c^mT ^ £ exp[S^->] n cxp [ Mj J . (A.3) 

PI V JJ r=l n < n <...<j r fe=l 
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It is not hard to see that this is a polynomial function of the unknown variables exp((ik) 
for k = 1, . . . , K. The independence of the random variables Y\,Y%, . ■ ■ , Yk is equivalent 
to that (A. 3) can be separated into components of (j,/. or equivalently exp(/ifc). 

(=>) If the random vector Y is independent, the moment generating function should 
be separable and assume the formulation is 

A" 

i>(m, . . . , (i K ) = C~[[(a k + /3 fe cxp[// fc ]), (A.4) 
fe=i 

where a k and fa are functions of parameters S"s and C is a constant. If we expand (A.4) 
to polynomial function of exp[/ifc] and determine the corresponding coefficients, (3.8) and 
(3.9) will be derived. 

(-4=) Suppose (3.9) holds, then we have 

exp[S^ 2 ->] = IJrxp/' . 
fc=i 

and as a result, the moment generating function can be decomposed to a product of 
components of exp[/ifc] like (A.4) with the following relations 

1 

exp[6(f)] 
1, 

cxp[/ fe ], 

□ 

Proof, of Theorem 3.2 

The idea of proving the group independence of multivariate Bernoulli variables are sim- 
ilar to Theorem 3.1. Instead of decomposing the moment generating function to products 
of fik j we only have to separate them into groups with each only involving the dependent 
random variables. That is to say, the moment generating function with two separately 
independent nodes in the multivariate Bernoulli should have the form 

ip(fi lt . . . , (i K ) = (c*o + "i cx p[Mi] H ha r cxp[^ r ]) ■ (/3 + /?i cxp[/i r+ i] H ^ f3 s exp[(i K }). 

Matching the corresponding coefficients of this separable moment generating function 
and the natural parameters leads to the conclusion (3.10). □ 
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